AITopics | relu mlp

Collaborating Authors

relu mlp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

Neural Information Processing SystemsMar-21-2026, 03:13:46 GMT

Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we introduce a key notion to predict and control feature learning: the angle $\theta_\ell$ between the feature updates and the backward pass (at layer index $\ell$). We show that the magnitude of feature updates after one GD step, at any training time, can be expressed via a simple and general *feature speed formula* in terms of this angle $\theta_\ell$, the loss decay, and the magnitude of the backward pass. This angle $\theta_\ell$ is controlled by the conditioning of the layer-to-layer Jacobians and at random initialization, it is determined by the spectrum of a certain kernel, which coincides with the Neural Tangent Kernel when $\ell=\text{depth}$. Given $\theta_\ell$, the feature speed formula provides us with rules to adjust HPs (scales and learning rates) so as to satisfy certain dynamical properties, such as feature learning and loss decay. We investigate the implications of our approach for ReLU MLPs and ResNets in the large width-then-depth limit. Relying on prior work, we show that in ReLU MLPs with iid initialization, the angle degenerates with depth as $\cos(\theta_\ell)=\Theta(1/\sqrt{\ell})$. In contrast, ResNets with branch scale $O(1/\sqrt{\text{depth}})$ maintain a non-degenerate angle $\cos(\theta_\ell)=\Theta(1)$. We use these insights to recover key properties of known HP scalings (such as $\mu$P), and also introduce a new HP scaling for large depth ReLU MLPs with favorable theoretical properties.

artificial intelligence, machine learning, proceedings, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

Neural Information Processing SystemsFeb-15-2026, 20:08:34 GMT

Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect

artificial intelligence, initialization, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MobILE: Model-BasedImitationLearning From ObservationAlone

Neural Information Processing SystemsFeb-11-2026, 20:22:52 GMT

Weprovide aunified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem isstrictly harder than the standard IL problem by presenting an exponential sample complexity separation between ILand ILFO.

ath, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Add feedback

a8f2713b5c6bdcd3d264f1aa9b9c6f03-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 05:52:57 GMT

computer vision, relu mlp, samurai, (13 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.70)

Add feedback

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Han, Yankun

arXiv.org Artificial IntelligenceOct-13-2025

Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.09423

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

Neural Information Processing SystemsOct-11-2025, 00:27:05 GMT

Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect

architecture, initialization, resnet, (13 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

d705dd6e77decdc399162d6d5b92f6e8-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 08:48:07 GMT

artificial intelligence, machine learning, relu mlp, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.52)
Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

Quantifying The Limits of AI Reasoning: Systematic Neural Network Representations of Algorithms

Kratsios, Anastasis, Zvigelsky, Dennis, Hart, Bradd

arXiv.org Artificial IntelligenceSep-17-2025

A main open question in contemporary AI research is quantifying the forms of reasoning neural networks can perform when perfectly trained. This paper answers this by interpreting reasoning tasks as circuit emulation, where the gates define the type of reasoning; e.g. Boolean gates for predicate logic, tropical circuits for dynamic programming, arithmetic and analytic gates for symbolic mathematical representation, and hybrids thereof for deeper reasoning; e.g. higher-order logic. We present a systematic meta-algorithm that converts essentially any circuit into a feedforward neural network (NN) with ReLU activations by iteratively replacing each gate with a canonical ReLU MLP emulator. We show that, on any digital computer, our construction emulates the circuit exactly--no approximation, no rounding, modular overflow included--demonstrating that no reasoning task lies beyond the reach of neural networks. The number of neurons in the resulting network (parametric complexity) scales with the circuit's complexity, and the network's computational graph (structure) mirrors that of the emulated circuit. This formalizes the folklore that NNs networks trade algorithmic run-time (circuit runtime) for space complexity (number of neurons). We derive a range of applications of our main result, from emulating shortest-path algorithms on graphs with cubic--size NNs, to simulating stopped Turing machines with roughly quadratically--large NNs, and even the emulation of randomized Boolean circuits. Lastly, we demonstrate that our result is strictly more powerful than a classical universal approximation theorem: any universal function approximator can be encoded as a circuit and directly emulated by a NN.

artificial intelligence, machine learning, neural network, (13 more...)

arXiv.org Artificial Intelligence

2508.18526

Country: North America > Canada > Ontario (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge

Behrens, Freya, Zdeborová, Lenka

arXiv.org Artificial IntelligenceJun-18-2025

Dataset distillation aims to compress training data into fewer examples via a teacher, from which a student can learn effectively. While its success is often attributed to structure in the data, modern neural networks also memorize specific facts, but if and how such memorized information is can transferred in distillation settings remains less understood. In this work, we show that students trained on soft labels from teachers can achieve non-trivial accuracy on held-out memorized data they never directly observed. This effect persists on structured data when the teacher has not generalized.To analyze it in isolation, we consider finite random i.i.d. datasets where generalization is a priori impossible and a successful teacher fit implies pure memorization. Still, students can learn non-trivial information about the held-out data, in some cases up to perfect accuracy. In those settings, enough soft labels are available to recover the teacher functionally - the student matches the teacher's predictions on all possible inputs, including the held-out memorized data. We show that these phenomena strongly depend on the temperature with which the logits are smoothed, but persist across varying network capacities, architectures and dataset compositions.

artificial intelligence, machine learning, student, (18 more...)

arXiv.org Artificial Intelligence

2506.14457

Country: North America > United States (0.46)

Genre: